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Abstract — A major problem with using relational 
databases, is writing efficient SQL queries. Some common 
errors known as anti-patterns are frequent in SQL queries 
and can seriously impact the query execution time and 
sometimes, the database general performance. This paper 
shows how ma-chine learning techniques can be lever¬ 
aged to detect anti-patterns in SQL queries by 
approaching the problem as a text classification problem. 
Our result is a model based on a convolutional neural net¬ 
work that can be used to classify a SQL query into zero, 
one or many anti-patterns classes. 

Keywords — SQL, relational database, text classification 
techniques . 1 

I. INTRODUCTION 

With the increasing amount of information stored in 
relational databases, it is necessary to write SQL queries 
that execute faster. Anti-patterns in SQL are common 
mistakes that if avoided, can make a query executes faster. 
For example, when query-ing an indexed column, 
replacing the OR operator with the IN operator, will result 
in better perfor-mance, because the IN operator leverages 
the in-dex Thus, using the OR operator in this case, is an 
anti-pattem. 

SELECT u r 1 FROM pictures WHERE 
i d = 10 OR i d =20 
can be rewritten as 

SELECT u r 1 FROM pictures WHERE 
id IN (10,20) 

By detecting the anti-pattems in a query, we can rewrite 
it into a better version. In this paper, we approach the 
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problem as a multi-class multi-label classification 
problem. Our solution is schema-independent, meaning 
that the decision made by the neural network doesn't 
depend on the database logical or physical structure. The 
dataset used has been built from SQL queries provided by 
Sky-Server from Sloan Digital Sky Survey (SDSS). 

SkyServer, the portal from the SDSS catalog, provides 
data access tools for astronomers and sci-entific education. 
Through SkyServer, users can use the SQL language to 
query the Sloan Digi-tal Server database. Since 2001, the 
portal has seen more than 280 million SQL queries submit¬ 
ted by users and those queries have been opened to the 
public through the different data releases. We fetch 1 
million queries from SkyServer, that we filter, process and 
transform. The final dataset of usage contains 363616 
unique SELECT queries. 

Following a supervised learning approach, the SQL 
queries from SkyServer are used as input data; we 
manually label the data by associating each SQL query 
with a list of anti-pattems it con-tains. 

Our model is based on a convolutional neural network 
trained to classify a query into multiple categories. We use 
the one-hot encoding technique to encode the queries as 
word vectors. For encoding the anti-pattems classes we 
use a one dimensional tensor with each class represented 
as an in-teger. 

We explore some of the important work in the field of 
SQL anti-pattems detection in section 2. In section 3, we 
explain in details the process fol-lowed to build the 
dataset. Then, we discuss our model architecture in section 
4. hi section 5, we analyze the results from our 
experiments. Finally in the conclusion, we compare our 
work to the existing solutions and explore the possible 
future work. 

II. RELATED WORK 

Common mistakes in SQL has been already in the interest 
of researchers before the appearance of the ISO SQL-92 
standard, hi 1985, Welty studied how human factors can 
affect users in using SQL and found that user performance 
could be significantly improved. Later, Brass et al. started 
working on the automatic detection of logical errors in 
SQL queries and extended their work with the recognition 
of common semantic mistakes. They implemented the 
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SO I .1 in i tool which was able to au-toinatically identify 
these errors in (syntactically correct) SQL statements. The 
tool seems to be unsupported today. There is another 
online tool named SO T .Tin t, but it is a SQL beautifier. 

There are also books in this area. The Art of SQL and 
Refactoring SQL Applications pro-vide guidelines to write 
efficient queries, while the book of Bill Karwin collects 
antipattems that should be avoided. 

In a paper, Ahadi et al.,presented a large-scale analysis 
of students semantic mistakes in writing SQL SELECT 
statements. They collected data from over 2,300 students 
across nine years and summarized typical mistakes of the 
students. They found that most of the mistakes were made 
in queries which require a JOIN, a subquery or a GROUP 
BY operator. We argue that queries typ-ically use more 
complex syntax in practice com-pared to student projects. 
Hence, the situation can be even worse. 

In the realm of embedded SQL, Christensen et al. 
proposed a technique and a tool (JSA, Java String 
Analyzer) to extract string expressions from Java code 
statically. As a potential application of their approach, they 
check the syntax of dynami-cally generated SQL strings. 
They limit their ap-proach to the syntactic validation of the 
queries. 

Wassermann et al. propose a static string analy-sis 
technique to identify possible errors in dynam-ically 
generated SQL code. With the implemen-tation of a CFL- 
reachability algorithm they detect type errors (e.g., 
concatenating a character to an integer value). Their 
approach works with ex-tracted query strings of valid SQL 
syntax In a tool demo paper, they present their prototype 
tool called JDBC Checker. 

Recently, Anderson and Hills studied query 
constmction patterns in PHP. They analyzed query strings 
embedded in PHP code with the help of the PHP AiR 
framework. 

Quality assessment of embedded SQL was pro-posedby 
Brink et al. in 2007. They analyzed em-bedded query 
strings in PIVSQL, Cobol, and Vi-sual Basic programs 
while they propose a generic approach which could be 
applied to Java too. They investigate relationships which 
could be detected through embedded queries (e.g., access, 
dupli-cation, control dependencies) and they propose 
quantitative query measures forquality assess-ment. 

Many static techniques which try to deal with embedded 
query strings do it with the purpose of SQL injection 
detection. Yeole and Meshram pub-lished a survey of 
these techniques. SQL injection detection is different as 
the goal is specifically to determine whether a query could 
be affected by user input. 

Some papers also tackle SQL fault localization 
techniques. A dynamic approach was proposed by Clark et 
al. to localize SQL faults in database applications. They 
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provide command-SQL tuples to show the SQL statements 
executed at database-interaction points. 

A recent work of Delplanque et al. targets the database 
to assess the quality of the schema and to detect design 
smells in it. They implement a tool called DBCritics which 
can analyze PostgreSQL schema dumps and identify 
design smells such as missing primaiy keys or foreign key 
references. 

A tool which also has to be mentioned here is the 
Eclipse plugin called Alvor and JSA [17], this plug-in 
analyzes the string expressions in Java code. What is more, 
Alvor checks syntax correct-ness, semantics correctness, 
and object availability by comparing the extracted queries 
against its in-temal SQL grammar and by checking SQL 
state-ments against an actual database. 

III. DATASET 

3.1 Collecting the queries 

We start building our dataset, by fetching 1 mil-Hon 
successful SQL queries from the SkyServer catalog. 

SELECT TOP 1000000 statement 

FROM S q 1 L o g 
WHERE error =0 

Some of these queries need to be filtered out, in order to 
build a more focused dataset. 

3.2 Filtering 

From the fetched queries, we remove the dupli-cates, so 
the dataset contains unique queries only. 

allQueries = list(set( value s)) 

As we focus on query anti-pattems, we remove all of 
the non SELECT queries, 
import re 

allQueries = 1 i s t ( 

filter( 

lambdaitem: re.search! 
’’"select”, 
item . lower() 

), 

allQueries 

) 

) 

In the end, the dataset is reduce from 1000000 to 318188 
queries. 

3.3 Trans forming 

In order to eliminate irrelevant information and reduce the 
size of our dataset vocabulary, we re-place all of the 
schema-related terms contained in the queries with 
standard words. Thus the queries contain almost only 
standard SQL keywords. 
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SELECT name from students; 

will be transformed to 
SELECT column from table; 

3.4 Annotating 

Following a supervised learning approach, having SQL 
queries as input, we need to map each query to a set of 
anti-pattems as labels. 

Our work is based on 16 common anti-pattems. To each 
of the query, we match a single anti-pattem. In fact, a 
single query can contain several anti-pattems, but for 
simplicity purpose, we only consider the most dominant 
anti-pattem. We ex-plain in detail each anti-pattem in the 
Appendix section. 
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Fig.l: Dataset visualization 

IV. MODEL 

4.1 Data Encoding 

4.1.1 Queries 

Each SQL query in our current dataset is a list of words. 
Word representation methods gener-ally fall into two 
categories. The first consists of methods such as one-hot 
vectors. This method is problematic due to homonymy and 
polysemy words. The other category consists of using un- 
supervised learning method to obtain continuous word 
vector representations. Recent research re-suits have 
demonstrated that continuous word rep-resentations are 
more powerful. 

In this paper, we use word embedding based on 
word2vec (Mikolov et al., 2013). To encode the SQL 
queries of our dataset, we choose to use the pre-trained 
google word2vec embedding. The model is trained on 100 
billion words from Google News by using the Skip-gram 
method and maxi-mizing the average log probability of all 
the words using a softmax function. Our result model con¬ 
tains 123.852 tokens. 



Fig.2: Queries Embedding 
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Fig.3: Anti-patterns list 


4.1.2 Anti-pattems 

As our work is based on 16 Fixed anti-pattems, we encode 
the label data as ID Vector of integers. 

4.2 Convolutionnal Neural Network 

The convolution neural network is a state-of-the-art 
method to model semantic representations of sentences. 
The convolution action has been com-monly used to 
synthesize lexical n-gram informa-tion. In our model, we 
use three different convo-lutional filters with varying 
convolution window size to form parallel CNNs so that 
they can learn multiple type of embedding of local regions 
so as to complement each other to improve model accu¬ 
racy. The final output is the concatenation of the output of 
each. 
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V. EXPERIMENTS 

5.1 Settings 

For all our experiments, we use the Stochastic Gradient 
Descent optimization algorithm with a learning rate of 0.1 
and a weight decay of 0.95. We conduct the experiments 
with 50 epochs and we use mini-batches of size 64. We 
evaluate the model every 100 steps. We use google pre¬ 
trained word2vec thus the dimension of each word vector 
is 300. 

We study the sensivity of the proposed model to the 
convolutional region size, the number of con-volutional 
feature and the dropout rate. We found that we achieve the 
best performance when we use the settings values listed in 
the Table I. 

Our model is developed in Python with Tensor-flow and 
Numpy libraries. The experiments are conducted on a 
MAC OS PC with 2.9 GHz Intel Core i7 processor and 8 
GB RAM. 
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Fig.4: Experimental Settings 


5.2 Validation method 

For validating our model we use the iterated K-Fold 
validation model. 

The dataset is split into 10 mini-datasets, which are used 
to validate each subset repeatedly. 

5.3 Results 

We compare our results with Sqlcheck . Sqlcheck is a lint 
tool that relies on syntax checking logic, to detect anti- 
patterns in SQL queries. We run SQL check on each of our 
dataset query, and store the results, which we then 
compare to ourCNN re-sults. 
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Fig.5: Experimental Settings 


After running the experiments, our model can detect 
anti-pattem in a query with an accuracy of 83.2. 

VL CONCLUSION AND FUTURE WORK 

In this work, we experimented using text classi-fication 
techniques to detect anti-pattems in SQL queries. The 
model uses a neural network with a custom dataset built 
from SkyServer catalog SQL queries. Experimental results 
demonstrate that, our model is quite accurate and can 
outperform lint syntax checking software. 

For the future, we could focus on rewriting queries 
based on the anti-pattems detected. 


[Vol-6, Issue-4, Apr-2019] 
ISSN: 2349-649S(P) / 2456-1908(0) 
ACKNOWLEDGEMENTS 

I would like to express my gratitude to my super-visor and 
corresponding author on this work, Xie Hongwei PhD, 
who has been a great mentor dur-ing the years of research. 

I would also like to thank my friend Miguel Kakanakou, 
for his very helpful advice and ex-perience sharing during 
the whole process of this work. 

APPENDIX 

Anti-pattems explanation select * 

When you SELECT *, you’re often retrieving more 
columns from the database than your appli-cation really 
needs to function. This causes more data to move from the 
database serverto the client, slowing access and increasing 
load on your ma-chines, as well as taking more time to 
travel across the network. 

Consider a scenario where you want to tune a query to a 
high level of performance. If you were to use *, and it 
returned more columns than you actually needed, the 
server would often have to perform more expensive 
methods to retrieve your data than it otherwise might. 

When you SELECT *, it’s possible to retrieve two 
columns of the same name from two different tables. This 
can often crash your data consumer null usage 
NULL is not the same as zero. A number ten greater than 
an unknown is still an unknown. NULL is not the same as 
a string of zero length. Combining any string with NULL 
in standard SQL returns NULL. NULL is not the same as 
false. Boolean expressions with AND, OR and NOT also 
produce results that some people find confusing not null 
usage 

When we declare a column as NOT NULL it should be 
because it would make no sense for the row to exist 
without a value in that column. 

string concatenation 

You may need to force a column or egression to be non- 
null for the sake of simplifying the query logic, but you 
don’t want that value to be stored. Use COALESCE 
function to construct the con-catenated expression so that a 
null-valued column doesn't make the whole expression 
become null. 

group by usage 

Every column in the select-list of a query must have a 
single value row per row group. 

order by rand usage 

Sorting by a nondeterministic expression (RAND()) means 
the sorting cannot benefit from an index 

pattern matching usage 
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The most important disadvantage of pattern-matching 
operators is that they have poor per-formance. A second 
problem of simple pattern-matching using LIKE or regular 
expressions is that it can find unintended matches. 

spaghetti query alert 

Split up a complex spaghetti query into several simpler 
queries 

reduce number of joins 

Too many JOINs is a symptom of complex spaghetti 
queries 

eliminate unnecessary distinct 

Too many DISTINCT conditions is a symptom of complex 
spaghetti queries. 

imp licit column usage 

Although using wildcards and unnamed columns satisfies 
the goal of less typing, this habit creates several hazards. 
This can break application refac-taring and can harm 
performance 

having clause usage 

Rewriting the query’s HAVING clause into a pred-icate 
will enable the use of indexes during query processing. 

nested sub queries 

Rewriting nested queries as joins often leads to more 
efficient execution and more effective opti-mization 

or usage 

Consider using an IN predicate when querying an indexed 
column 

union usage 

Unlike UNION which removes duplicates, UNION ALE 
allows duplicate tuples. 

distinct &join usage 

The DISTINCT keyword removes duplicates after sorting 
the tuples. Instead, consider using a sub query with the 
EXISTS keyword, you can avoid having to return an entire 
table. 
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